Anton Kulesh, Data Scientist
Datathon
Minsk, July 2019
In this notebook we play with different tools for interpretatitng of machine learning models. We solve binary classification problem, try to predict probability of customer churn and explain predictions of the black-box model. As a black-box model we use gradient boosting on decision trees (XGBoost).
- LIME
- ELI5
- SHAP
- InterpretML
- Gain
- Splits count
- Coverage
- Permutation
- mean(|shap_values|)
Ok, let's go!
import os
import warnings
warnings.filterwarnings("ignore")
# LIME
import lime
from lime.lime_tabular import LimeTabularExplainer
from lime import submodular_pick
#
# ELI5
import eli5
from eli5.sklearn import PermutationImportance
#
# SHAP
import shap
#
# InterpretML
import interpret
from interpret.data import ClassHistogram
from interpret.perf import ROC
from interpret.blackbox import ShapKernel, LimeTabular, MorrisSensitivity, PartialDependence
from interpret.glassbox import ExplainableBoostingClassifier, LogisticRegression
#
# Data manipulation and modeling
import numpy as np
import pandas as pd
from sklearn.dummy import DummyClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, classification_report
import xgboost as xgb
from xgboost import XGBClassifier
#
# Plotting
import matplotlib.pyplot as plt
%matplotlib inline
# For plotting nice shap /graphs
shap.initjs()
Each row represents a customer, each column contains customer's attributes. The data set includes information about:
Churn (target variable)PhoneService), multiple lines (MultipleLines), internet (InternetService), online security (OnlineSecurity), online backup (OnlineBackup), device protection (DeviceProtection), tech support (TechSupport), streaming TV (StreamingTV) and movies (StreamingMovies) tenure), contract (Contract), payment method (PaymentMethod), paperless billing (PaperlessBilling), monthly charges (MonthlyCharges), and total charges (TotalCharges)gender, age range (SeniorCitizen), and if they have partners (Partner) and dependents (Dependents)After loading the data we'll:
path_to_data = "./data/telco-customer-churn.zip"
data = pd.read_csv(path_to_data, compression="zip")
del data["customerID"]
data['TotalCharges'] = data['TotalCharges'].replace(" ", 0).astype('float32')
data['gender'] = data['gender'].apply(lambda x: 1 if x == "Female" else 0)
bool_columns = ['Partner', 'Dependents', 'PhoneService',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
'TechSupport', 'StreamingTV', 'StreamingMovies',
'PaperlessBilling', 'Churn'
]
for col in bool_columns:
data[col] = data[col].apply(lambda x: 1 if x == "Yes" else 0)
columns = data.columns.to_list()[:-1]
categorical_names = ['MultipleLines', 'InternetService', 'Contract', 'PaymentMethod']
categorical_columns = [columns.index(col) for col in categorical_names]
encoder = OrdinalEncoder()
data[categorical_names] = encoder.fit_transform(data[categorical_names])
categorical_names_dict = {col: list(values) for col, values in zip(categorical_columns, encoder.categories_)}
def categorical_decode(data, idx):
return dict(zip(
categorical_names,
encoder.inverse_transform(
data[idx, categorical_columns].reshape(1, -1))[0]
)
)
data.info()
Let's look at the prepared data sample
data.head().T
Let's look at the target variable distribution
target = "Churn"
data[target].plot(kind="hist");
We can observe some class imbalance. But it's not big problem.
data[target].value_counts(1)
We omit feature engineering phase, as this is redundant in context of our goals.
Let's simply split out data to train (70% for model training) and test (30% for model validation) sets.
X_data = data.copy()
y_data = X_data.pop(target)
X_train, X_test, y_train, y_test = train_test_split(X_data.values, y_data.values, test_size=0.3, random_state=42)
print("Train size: {}".format(X_train.shape))
print("Test size: {}".format(X_test.shape))
clf = XGBClassifier(max_depth=3, n_estimators=50, importance_type="gain")
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)[:, 1]
print("ROC-AUC: {}".format(roc_auc_score(y_test, y_proba)))
Let's take a closer look at the first tree from our ensemble.
booster = clf.get_booster()
original_feature_names = booster.feature_names
booster.feature_names = columns
print(booster.get_dump()[0])
booster.feature_names = original_feature_names
We put our predictions and true labels in one table. Then we convert predicted probabilities into binary labels (threshold=0.5)
results = pd.DataFrame(np.vstack((y_test, y_proba)).T, columns=["y_test", "y_proba"])
results["y_test"] = results["y_test"].astype("bool")
results["y_pred"] = results["y_proba"] >= 0.5
results["error"] = results['y_proba'] - results['y_test']
results["abs_error"] = abs(results['y_proba'] - results['y_test'])
results.sort_values('error', ascending=False, inplace=True)
Let's at the classification report
print(classification_report(results["y_test"], results["y_pred"]))
Now let's compare performance of our "black box" with "dummy" model (generates predictions by respecting the training set's class distribution)
dummy = DummyClassifier()
dummy.fit(X_train, y_train)
y_proba_dummy = dummy.predict_proba(X_test)[:, 1]
print("ROC-AUC: {}".format(roc_auc_score(y_test, y_proba_dummy)))
print(classification_report(y_test, y_proba_dummy >= 0.5))
Not bad! We can see that model shows quite promising results even without applying smart feature engineering technique and model parameters tuning. So let's try to give some explanations and crack our black box open.
lime package provides working with models which have different input types:
LimeTabularExplainer -- working with tabular dataRecurrentTabularExplainer -- working with time seriesLimeTextExplainer -- working with text data inputsLimeImageExplainer -- explains predictions on imagesFor our tast we use LimeTabularExplainer. Let's create explainer (LimeTabularExplainer class object) and get explanation for some examples (explainer.explain_instance).
def create_explainer(data, **kwargs):
explainer = LimeTabularExplainer(
data, feature_names=columns, categorical_features=categorical_columns,
categorical_names=categorical_names_dict, class_names=["no churn", "churn"],
discretize_continuous=True, **kwargs)
return explainer
Let's look at the top 5 features and their contributions
i = 2
explainer = create_explainer(X_train, verbose=True, kernel_width=None)
exp = explainer.explain_instance(X_test[i], predict_fn=clf.predict_proba, num_samples=1000,
num_features=10, model_regressor=None)
print("Predicted_label: [{}]".format(int(y_proba[i] >= 0.5)))
print("True_label: [{}]".format(y_test[i]))
exp_df = pd.DataFrame(exp.as_list(), columns=['feature_value', 'feature_contribution'])
exp_df
Let's sum weights and compare with Prediction_local above. We can see that it's exactly the same value.
exp_df.feature_contribution.sum() + exp.intercept[1]
Detailed explanation
exp.show_in_notebook(show_table=True, show_all=False)
Feature contributions
exp.as_pyplot_figure();
Descriptive statistics of numerical features
num_columns = ['tenure', 'MonthlyCharges', 'TotalCharges']
X_data[num_columns].describe()
We can also save our explanation as html file
exp.save_to_file("./data/demo_0.html)
i = 0
exp = explainer.explain_instance(X_test[i], predict_fn=clf.predict_proba, num_samples=1000,
num_features=10, model_regressor=None)
print("Predicted_label: [{}]".format(int(y_proba[i] >= 0.5)))
print("True_label: [{}]".format(y_test[i]))
exp.as_pyplot_figure();
This technique allows us to select a set of representative instances from our dataset.
explainer = create_explainer(X_train)
%time sp_explanations = submodular_pick.SubmodularPick(explainer, X_test, clf.predict_proba, method="full")
for i, explanation in zip(sp_explanations.V, sp_explanations.sp_explanations):
print("\nExplanation #{}".format(i))
print("\nIntercept: {}".format(explanation.intercept))
print("\nLocal_prediction: {}".format(explanation.local_pred[0]))
print("\nModel_prediction: {}".format(y_proba[i]))
print("\nTrue_label: {}".format(y_test[i]))
explanation.show_in_notebook();
We can observe that for some examples local prediction has high residuals. It looks like a weakness of LIME approach. Maybe it related with neibourhoods definition and local kernel settings. So, in general, we can try to pick optimal parameters for our explainer, but it out-of-scope this notebook.
i = 0
eli5.show_prediction(clf, X_test[i],
feature_names=columns,
show_feature_values=True,
show=["targets",
"feature_importances",
"description",
"method"])
eli5.explain_prediction_df(clf, X_test[i], feature_names=columns)
eli5.show_weights(clf, feature_names=columns)
It's just a nice way to show model's feature importances. XGBoost by default uses gain, that is the average gain of the feature.
pd.DataFrame.from_records(
data=zip(columns, clf.feature_importances_),
columns=["name", "importance"]).sort_values(by="importance", ascending=False)
ELI5 also allows us to use Permutation Importance technique, and we will mention it in "Global Interpretability" section.
SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations, uniting several previous methods and representing the only possible consistent and locally accurate additive feature attribution method based on expectations.
see repo for details: https://github.com/slundberg/shap
Efficient implementation of SHAP for tree models: https://arxiv.org/pdf/1802.03888.pdf.
Let's create explainer and calculate SHAP values
explainer = shap.TreeExplainer(clf, X_train, model_output="probability", feature_dependence="independent")
%time shap_values = explainer.shap_values(X_test[:500])
SHAP values of the first sample
shap_values[0]
print("Model prediction: {}".format(y_proba[0]))
print("Sum of SHAP values + base values: {}".format(sum(shap_values[0]) + explainer.expected_value))
print("Expected value: {}".format(explainer.expected_value))
print("Average prediction: {}".format(clf.predict_proba(X_train)[:, 1].mean(0)))
i = 0
shap.force_plot(base_value=explainer.expected_value, shap_values=shap_values[i,:], features=X_test[i,:], feature_names=columns)
shap.force_plot(explainer.expected_value, shap_values[:500,:], X_test[:500,:], feature_names=columns)
[..] To get an overview of which features are most important for a model we can plot the SHAP values of every feature for every sample. The plot below sorts features by the sum of SHAP value magnitudes over all samples, and uses SHAP values to show the distribution of the impacts each feature has on the model output. The color represents the feature value (red high, blue low).
shap.summary_plot(shap_values, X_test[:500], feature_names=columns)
churn_mask = y_test[:500] == 1
no_churn_mask = y_test[:500] == 0
print("Customers in churn: {}".format(sum(churn_mask)))
print("Customers is alive: {}".format(sum(no_churn_mask)))
shap.summary_plot(shap_values[churn_mask], X_test[:500][churn_mask], feature_names=columns)
shap.summary_plot(shap_values[no_churn_mask], X_test[:500][no_churn_mask], feature_names=columns)
shap.summary_plot(shap_values, X_test, plot_type="bar", feature_names=columns)
[..] SHAP dependence plots show the effect of a single feature across the whole dataset. They plot a feature's value vs. the SHAP value of that feature across many samples. SHAP dependence plots are similar to partial dependence plots, but account for the interaction effects present in the features, and are only defined in regions of the input space supported by data. The vertical dispersion of SHAP values at a single feature value is driven by interaction effects, and another feature is chosen for coloring to highlight possible interactions.
Below we create dependence plots for most important 7 features
feature_importance = pd.DataFrame.from_records(
data=list(zip(columns, np.mean(np.abs(shap_values), axis=0))),
columns=["name", "importance"]
)
feature_importance.sort_values(by="importance", ascending=False, inplace=True)
n_top = 7
for col in feature_importance.name[:n_top]:
shap.dependence_plot(col, shap_values, X_test[:500], feature_names=columns)
[..] Kernel SHAP is a method that uses a special weighted linear regression
to compute the importance of each feature. The computed importance values
are Shapley values from game theory and also coefficents from a local linear
regression.
model = lambda x: clf.predict_proba(x)[:, 1]
Use apply kmeans to reduce input dimensions, get approximation of original data and use it as a background dataset.
data = shap.kmeans(X_train, 100)
kernel_explainer = shap.KernelExplainer(model, data, feature_names=columns)
print("Expected values (KernalExplainer): {}".format(kernel_explainer.expected_value))
print("Expected values (TreeExplainer): {}".format(explainer.expected_value))
%time shap_values_kernel = kernel_explainer.shap_values(X_test[0])
KernelExplainer values
shap_values_kernel
shap.force_plot(base_value=float(kernel_explainer.expected_value),
shap_values=shap_values_kernel,
feature_names=columns)
TreeExplainer values
shap_values[0, :]
shap.force_plot(base_value=explainer.expected_value,
shap_values=shap_values[0],
feature_names=columns)
mean_squared_error(shap_values[0, :], shap_values_kernel)
Microsoft Research has developed an algorithm called the Explainable Boosting Machine (EBM) which has both high accuracy and intelligibility. EBM uses modern machine learning techniques like bagging and boosting to breathe new life into traditional GAMs (Generalized Additive Models). This makes them as accurate as random forests and gradient boosted trees, and also enhances their intelligibility and editability.
This tool looks very promising. It has nice functionality but is still under development.
see repo: https://github.com/microsoft/interpret
hist = ClassHistogram(feature_names=columns).explain_data(X_train, y_train, name="EDA")
interpret.show(hist)
ebc = ExplainableBoostingClassifier(feature_names=columns, random_state=42)
%time ebc.fit(X_train, y_train)
n = 10
ebc_local = ebc.explain_local(X_test[:n], y_test[:n], name="EBC")
interpret.show(ebc_local)
i = 0
categorical_decode(X_test, i)
ebc_global = ebc.explain_global(name="EBC")
interpret.show(ebc_global)
ebc_roc_curve = interpret.perf.ROC(
predict_fn=ebc.predict_proba).explain_perf(X_test, y_test, name='Churn detection (EBC)')
interpret.show(ebc_roc_curve)
Add one more model
lr = LogisticRegression(feature_names=columns)
lr.fit(X_train, y_train)
lr_global = lr.explain_global(name="LR")
lr_local = lr.explain_local(X_test[:n], y_test[:n], name="LR")
lr_roc_curve = interpret.perf.ROC(
predict_fn=lr.predict_proba).explain_perf(X_test, y_test, name='Churn detection (LR)')
Let's look at the entire picture
interpret.show(
explanation=[hist, ebc_roc_curve, ebc_global, ebc_local, lr_roc_curve, lr_global, lr_local],
share_tables=True
)
Explanation of tabular data using LIME (LimeTabular)
n = 10
lime_explainer = LimeTabular(predict_fn=clf.predict_proba,
data=X_train,
feature_names=columns,
random_state=42
)
%time lime_local = lime_explainer.explain_local(X_test[:n], y_test[:n], name='LIME (LimeTabular)')
Currently interpret supports only ShapKernel (aka KernelShap)
background_data = shap.kmeans(X_train, 100).data
shap_explainer = ShapKernel(predict_fn=clf.predict_proba, data=background_data, feature_names=columns)
shap_local = shap_explainer.explain_local(X_test[:n], y_test[:n], name='SHAP (ShapKernel)')
interpret.show([lime_local, shap_local], share_tables=True)
def feature_importances_table(feature_importances_array, feature_names=None, n_top=None):
if feature_names is None:
feature_names = X_data.columns
if n_top is None:
n_top = len(feature_importances_array)
feature_importances = list(zip(feature_names, feature_importances_array))
feature_importances = pd.DataFrame.from_records(
feature_importances, columns=["feature_name", "importance"])
feature_importances.sort_values("importance", inplace=True, ascending=True)
feature_importances = feature_importances[-n_top:]
return feature_importances
def show_feature_importances(feature_importances, feature_names=None, n_top=None):
if not isinstance(feature_importances, pd.DataFrame):
feature_importances = feature_importances_table(
feature_importances, feature_names=feature_names, n_top=n_top)
ax = feature_importances.plot.barh(
x="feature_name", y="importance", figsize=(10, 8),
fontsize=16, color='green', alpha=0.8)
ax.grid()
plt.title("Feature importances", fontsize=20)
plt.show()
def calculate_feature_saturation(feature_importances, model=None,
feature_names=None, n_top=None, show=True, label=None):
if model is None:
model = XGBClassifier(max_depth=3, n_estimators=50)
if not isinstance(feature_importances, pd.DataFrame):
feature_importances = feature_importances_table(
feature_importances, feature_names=feature_names, n_top=n_top)
score = {}
for i in range(1, len(feature_importances)):
features = feature_importances[-i:]["feature_name"]
column_mask = X_data.columns.isin(features)
model.fit(X_train[:, column_mask], y_train)
y_proba = model.predict_proba(X_test[:, column_mask])[:, 1]
score[i] = roc_auc_score(y_test, y_proba)
return score
def show_feature_saturation(scores, labels, colors, show=True):
plt.figure(figsize=(12, 6))
for score, label, color in zip(scores, labels, colors):
x = list(score.keys())
y = list(score.values())
plt.plot(x, y, "-*", color=color, label=label)
plt.title("Dependence of the model score on the number of features")
plt.xlabel("Numbel of features")
plt.ylabel("ROC AUC")
plt.legend()
plt.grid()
plt.show()
perm = PermutationImportance(clf, random_state=42).fit(X_train, y_train)
eli5.show_weights(perm, feature_names=X_data.columns.to_list())
feature_importances_by_permutation = np.abs(perm.feature_importances_)
show_feature_importances(feature_importances_by_permutation)
Tree SHAP)¶explainer = shap.TreeExplainer(clf, X_train, model_output="probability", feature_dependence="independent")
%time shap_values = explainer.shap_values(X_train)
feature_importances_by_shap = np.mean(np.abs(shap_values), axis=0)
show_feature_importances(feature_importances_by_shap, n_top=None)
See for details: https://en.wikipedia.org/wiki/Morris_method
sensitivity = MorrisSensitivity(predict_fn=clf.predict_proba, data=X_train, feature_names=columns)
sensitivity_global = sensitivity.explain_global(name="Global Sensitivity")
feature_importances_by_moris = sensitivity_global.data()["scores"]
show_feature_importances(feature_importances_by_moris)
The average training loss reduction gained when using a feature for splitting.
clf = XGBClassifier(max_depth=3, n_estimators=50, importance_type="gain")
clf.fit(X_train, y_train)
feature_importances_by_gain = clf.feature_importances_
show_feature_importances(feature_importances_by_gain, n_top=None)
The number of times a feature is used to split the data across all trees.
clf = XGBClassifier(max_depth=3, n_estimators=50, importance_type="weight")
clf.fit(X_train, y_train)
feature_importances_by_weight = clf.feature_importances_
show_feature_importances(feature_importances_by_weight, n_top=None)
The number of times a feature is used to split the data across all trees weighted by the number of training data points that go through those splits.
clf = XGBClassifier(max_depth=3, n_estimators=50, importance_type="cover")
clf.fit(X_train, y_train)
feature_importances_by_cover = clf.feature_importances_
show_feature_importances(feature_importances_by_cover, n_top=None)
scores = [calculate_feature_saturation(fi)
for fi in [
feature_importances_by_permutation,
feature_importances_by_shap,
feature_importances_by_moris,
feature_importances_by_gain,
feature_importances_by_weight,
feature_importances_by_cover,
]
]
labels = ["permutation", "shap", "moris", "gain", "weight", "cover"]
colors = ["red", "blue", "green", "yellow", "pink", "grey"]
show_feature_saturation(scores, labels=labels, colors=colors)